Exclude DataNodes being removed from new Region allocation#17934
Conversation
When a `remove datanode` is in progress, the ConfigNode could still allocate brand-new Region replicas onto the DataNode being removed. This was especially likely when the target DataNode had been killed (e.g. `kill -9`) before removal: the failure detector reports such a node as `Unknown` rather than `Removing`, and `RegionBalancer` intentionally keeps `Unknown` DataNodes as allocation candidates (to cope with insufficient online nodes). The new replica could never be created on the dead node (Connection refused), yet the metadata kept the assignment and retried forever, so the removal hung and the target DataNode never disappeared from `show datanodes`. A node-status filter alone cannot fix this, because the killed node is `Unknown`, not `Removing`. Instead, `RegionBalancer` now consults the in-progress `RemoveDataNodesProcedure` (the authoritative, leader-switch durable source of which DataNodes are being removed) via the new `ProcedureManager.getRemovingDataNodeIds()` and drops those DataNodes from the allocation candidates. This mirrors the existing filtering in `RemoveDataNodeHandler.selectedRegionMigrationPlans`. Add IoTDBRemoveDataNodeRegionAllocationIT: it kills a DataNode, submits the removal, and while it is in progress forces a fresh Region allocation via a new database, asserting that none of the newly allocated Regions land on the DataNode being removed and that the removal completes.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17934 +/- ##
============================================
- Coverage 41.07% 41.06% -0.01%
Complexity 318 318
============================================
Files 5257 5257
Lines 365010 365023 +13
Branches 47180 47180
============================================
- Hits 149918 149910 -8
- Misses 215092 215113 +21 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Caideyipi
left a comment
There was a problem hiding this comment.
Reviewed the change and did not find any blocking issue. The fix correctly excludes DataNodes covered by an unfinished RemoveDataNodesProcedure from new Region allocation while preserving the existing Unknown-node candidate behavior.
There was a problem hiding this comment.
Pull request overview
This PR prevents ConfigNode from allocating new Region replicas onto DataNodes that are currently being removed (even if they appear as Unknown due to heartbeat loss), avoiding stranded replicas that can stall RemoveDataNodesProcedure indefinitely.
Changes:
- Add
ProcedureManager.getRemovingDataNodeIds()to derive the authoritative “removing DataNode” set from in-progressRemoveDataNodesProcedureinstances. - Update
RegionBalancer.genRegionGroupsAllocationPlanto filter out those removing DataNodes from region-allocation candidates while still allowingUnknownnodes in general. - Add an integration test that kills a DataNode, starts removal, forces new Region allocation, and asserts newly allocated Regions do not land on the removing DataNode and that removal completes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/ProcedureManager.java | Exposes removing-DataNode IDs by scanning in-progress RemoveDataNodesProcedures. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/balancer/RegionBalancer.java | Excludes removing DataNodes from allocation candidates to prevent allocating new replicas onto nodes being removed. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/removedatanode/IoTDBRemoveDataNodeRegionAllocationIT.java | Adds regression coverage for “new Region allocation during remove datanode” scenario. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The test re-opened a multi-node connection after kill -9'ing the target DataNode. EnvFactory.getConnection() fans every read out to all DataNodes including the dead one, so each executeQuery returned [Connection Error, null, null, null] and failed with InconsistentDataException at the connection site rather than exercising the bug. Pin both the post-kill connection and the allocation-probe connection to a surviving DataNode via a new BaseEnv.getConnection(DataNodeWrapper) overload.
|
Fixed the CI failure in `IoTDBRemoveDataNodeRegionAllocationIT` (e75e196). Root cause: after `kill -9`-ing the target DataNode, the test re-opened a multi-node `EnvFactory.getConnection()`. That connection fans every read out to all DataNodes, including the dead one, so each `executeQuery` (`SHOW DATANODES`, region maps) returned `[Connection Error, null, null, null]` and failed with `InconsistentDataException` at the connection site — before the test could exercise the allocation behaviour. Fix: pin all post-kill SQL (both the main statement and the allocation-probe connection) to a surviving DataNode via a new `BaseEnv.getConnection(DataNodeWrapper)` overload, which scopes write and read to a single live node. This matches the existing framework intent ("This is useful when you shut down a dataNode"). The remaining CI reds (`IoTDBConnectionsIT` on Windows, `IoTDBPipeClusterIT` on dual-table-manual-enhanced) are unrelated pre-existing flakes — they touch the session and pipe subsystems, not region allocation. |
With Ratis schema consensus and schemaReplicationFactor=2, kill -9'ing one of the two schema-replica holders breaks Ratis quorum: the schema-region migration off the dead node fails (DO_ADD_REGION_PEER times out), so RemoveDataNodesProcedure rolls back and the node never leaves the cluster, making awaitUntilSuccess time out after 5 minutes. Raise schemaReplicationFactor to 3 (matching the proven IoTDBRemoveUnknownDataNodeIT#successTest config) so one kill still leaves a quorum (2 of 3) and the removal can actually complete. With 4 DataNodes, removing 1 leaves exactly MINIMUM_DATANODE = max(schemaRF, dataRF) = 3, so the removal is still permitted and the allocation candidate pool (the 3 survivors) is large enough for the forced new RF=3 schema / RF=2 data regions.
|
Second CI fix (9fb2e92). After the connection fix, the IT got past the connection step and the core assertion (`assertNewRegionsExcludeDataNode` — new Regions exclude the removing node) passed, but it then timed out at `awaitUntilSuccess`: the removal never completed. Root cause (from the ConfigNode cluster logs): schema Regions use Ratis, and with `schemaReplicationFactor=2`, `kill -9`-ing one of the two replica holders breaks Ratis quorum. The schema-Region migration off the dead node (`AddRegionPeerProcedure.DO_ADD_REGION_PEER`) times out because the old 2-peer config (one dead) can't reach majority, so `RemoveDataNodesProcedure` rolls back (`migratedFailedRegions:[SchemaRegion id:0]`) and the node stays in the cluster. This is correct product behaviour — it was a test-config mistake. Fix: raise `schemaReplicationFactor` to 3 (matching the proven `IoTDBRemoveUnknownDataNodeIT#successTest` config). One kill then leaves a 2-of-3 quorum so the schema Region can migrate and the removal completes. With 4 DataNodes, removing 1 leaves exactly `MINIMUM_DATANODE = max(schemaRF, dataRF) = 3`, so removal is still permitted and the candidate pool (3 survivors) is large enough for the forced new RF=3 schema / RF=2 data Regions. |
|



Problem
When a
remove datanodeis in progress, the ConfigNode could still allocate brand-new Region replicas onto the DataNode being removed.This is especially likely when the target DataNode was killed (e.g.
kill -9) before the removal: the failure detector reports such a node asUnknownrather thanRemoving, andRegionBalancerintentionally keepsUnknownDataNodes as allocation candidates (to cope with an insufficient number of online nodes). As a result, a new region (e.g. for the internalroot.__systemdatabase) could be assigned a replica on the dead node. That replica can never be created there (Connection refused), yet the metadata keeps the assignment and retries forever, so theRemoveDataNodesProceduregets stuck and the target DataNode never disappears fromshow datanodes.Observed timeline (from the report):
Root cause
RegionBalancer.genRegionGroupsAllocationPlangathers allocation candidates with:A node-status filter alone cannot fix this: a DataNode killed before removal is
Unknown(notRemoving), becauseDataNodeHeartbeatCache.updateCurrentStatisticslets the failure detector override theRemovingstatus back toUnknownonce heartbeats stop. So the removing node still passes the filter.Fix
Consult the in-progress
RemoveDataNodesProcedure— the authoritative, leader-switch-durable source of which DataNodes are being removed — and exclude those nodes from the allocation candidates.ProcedureManager.getRemovingDataNodeIds(), which scans the unfinishedRemoveDataNodesProcedure(s) and returns the removing DataNode ids. This mirrors the existing pattern inProcedureManager.checkRemoveDataNodes/checkRegionOperationWithRemoveDataNode.RegionBalancer.genRegionGroupsAllocationPlannow filters those ids out of the candidate list, in the same spirit as the existingRemoveDataNodeHandler.selectedRegionMigrationPlans.This keeps the existing "allow
Unknowncandidates when online nodes are scarce" behavior intact, and only removes nodes that are actively being removed. The removal procedure already guarantees (checkEnoughDataNodeAfterRemoving) that enough non-removing nodes remain, so this never spuriously triggersNotEnoughDataNodeException.Test
IoTDBRemoveDataNodeRegionAllocationITkills a DataNode that hosts regions, submits the removal, and — while the removal is still in progress — forces a fresh Region allocation by creating a new database. It asserts that none of the newly allocated regions land on the DataNode being removed (comparing against a pre-allocation snapshot of region ids, since the removing node legitimately keeps hosting its own pre-existing regions until each finishes migrating away), and that the removal then completes.🤖 Generated with Claude Code